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The VC Dimension 
Roadmap 
@ When Can Machines Learn? 
@ Why Can Machines Learn? 


Lecture 6: Theory of Generalization 
Eout ~ Ein possible 
if m;,(N) breaks somewhere and N large enough 









Lecture 7: The VC Dimension 
Definition of VC Dimension 
VC Dimension of Perceptrons 

Physical Intuition of VC Dimension 
Interpreting VC Dimension 





© How Can Machines Learn? 
@ How Can Machines Learn Better? 
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The VC Dimension Definition of VC Dimension 


Recap: More on Growth Function 


k-1 (N 
my(N) of break point k < B(N,k)= So ( ) 
—— 
highest term //^—' 




















provably & loosely, for N > 2, k > 3, 







k—1 


my(N) x B(N,k) = X` ii < NET 


j= 





The VC Dimension Definition of VC Dimension 


Recap: More on Vapnik-Chervonenkis (VC) Bound 
For any g = .A(D) € H and ‘statistical’ large D, ioa > 3 


P» ||En(9) mj Eou(g)| e e 
3h € H st. |En(h) — Eou(h)| > 1 


IA 





Pp 


== 


< 4m (2N) exp (-32N) 


if k exists 
a 


4(2N)* | exp (-g&N) 








if (1) mu (N) breaks at k (good 71) 
(2) N large enough (good D) 

==> probably generalized ‘Fou; ~ Ein’, and 

if (3) A picks a g with small Ein (good A) 

==> probably learned! (:-) good luck) 
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The VC Dimension Definition of VC Dimension 


VC Dimension 
the formal name of maximum non-break point 
Definition 
VC dimension of H, denoted ayc(H) is 













largest N for which my (N) = 2% 


e the most inputs H that can shatter 
e Ac = ‘minimum K' - 1 


N< dco = %#H can shatter some N inputs 
k >de = kisa break point for H 


if N > 2, dvo > 2, my(N) < NX 


The VC Dimension Definition of VC Dimension 


The Four VC Dimensions 


e positive rays: my(N) — N 4 1 
do — 1 : 

e positive intervals: ma(N) = 4N? - TN 4 1 
dic — 2 ee 

* convex sets: ma(N) = 2% 
dvc = oo Laer 

e 2D perceptrons: ma(N) < N? for N > 2 
Ac = 3 j 





good: finite dc | 
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The VC Dimension Definition of VC Dimension 


VC Dimension and Learning 


finite dy; = > g ‘will’ generalize (Eou(g) ~ Ein(g)) 


e regardless of learning algorithm A 
e regardless of input distribution P 
e regardless of target function f 








Gan target function ) unknown 
f: X >Y PonX 





























X4, Xo XN ^ Saa _ x 
oe P MPO" TENE 
training examples ida i final hypothesis 
D: (0). Ra Yu da^ m gef 
N ~ J 
x 2 





qn set 
"worst case’ guarantee 
on generalization 
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The VC Dimension Definition of VC Dimension 


Fun Time 
If there is a set of N inputs that cannot be shattered by H. Based 










only on this information, what can we conclude about dyc(71)? 
© aAc(H) » N 
e Avc(H) =N 
© aAc(H) « N 
© no conclusion can be made 





Reference Answer: (4) 


It is possible that there is another set of N 
inputs that can be shattered, which means 
dvc > N. It is also possible that no set of N 
input can be shattered, which means dyc « N. 
Neither cases can be ruled out by one 
non-shattering set. 
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The VC Dimension VC Dimension of Perceptrons 


2D PLA Revisited 


linearly separable D with Xn ~ P and y, = fF (Xn) 
Y v 
PLA can converge  P||Fin(g) — Eou(g)| > e] € ... by We = 3 


T large N large 


Ein(g) — 0 Eou(g) ~ Ein(g) 
~ 4 
Eou(g) ~ 0 :-) 





general PLA for x with more than 2 features? | 


The VC Dimension VC Dimension of Perceptrons 


VC Dimension of Perceptrons 


e 1D perceptron (pos/neg rays): dyc = 2 
e 2D perceptrons: ayo = 3 
e Ac 23: 


e Ac <3: ex 





e d-D perceptrons: dyco Żd+1 


two steps: 
e dicz d 1 
e. Ac =< d+1 





The VC Dimension VC Dimension of Perceptrons 


Extra Fun Time 


What statement below shows that ad, > d +1? 
© There are some d + 1 inputs we can shatter. 
© We can shatter any set of d + 1 inputs. 
© There are some d + 2 inputs we cannot shatter. 
© We cannot shatter any set of d + 2 inputs. 














Reference Answer: 0 


dyc is the maximum that m;,(N) = 2", and 
ma,(N) is the most number of dichotomies of N 
inputs. So if we can find 2%+! dichotomies on 
some d + 1 inputs, m;(d + 1) = 24*! and 
hence dyc > d+ 1. 
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The VC Dimension VC Dimension of Perceptrons 


There are some d + 1 inputs we can shatter. | 


e some ‘trivial’ inputs: 


—xl — 1 0 0 

— x} — RO 0 

X= — xl — = ROS 0 
; - 20 

Xe jean OE 


e visually in 2D: à 





note: X invertible! j 
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The VC Dimension VC Dimension of Perceptrons 


Can We Shatter X? 


—xl — 1:00 70 
1 o 


invertible 










for any y = , find w such that 






Yd+1 


X Wacol: 








w-x y 





sign(Xw)=y <= (Xw)-y 





‘special’ X can be shattered —> ayo > d+ 1 j 
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The VC Dimension VC Dimension of Perceptrons 


Extra Fun Time 


What statement below shows that ad. < d + 1? 
© There are some d + 1 inputs we can shatter. 
© We can shatter any set of d + 1 inputs. 
© There are some d + 2 inputs we cannot shatter. 
© We cannot shatter any set of d + 2 inputs. 





















Reference Answer: (4) 


dyc is the maximum that m,(N) = 2", and 
ma,(N) is the most number of dichotomies of N 
inputs. So if we cannot find 29+? dichotomies 
on any d + 2 inputs (i.e. break point), 

my(d +2) < 29+? and hence dy; < d +2. 
That is, dyc € d 4 1. 
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The VC Dimension VC Dimension of Perceptrons 


Oe ONES) 


A 2D Special Case 





X G 
? cannot be x 


T 


w! x, = w'x; +w'x3 -wx > 0 
SS SML eld 


o o x 





linear dependence restricts dichotomy 
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The VC Dimension VC Dimension of Perceptrons 
e od i CE 


d-D General Case 








— x’ — 
xi more rows than columns: 
—xl— 
a 2 linear dependence (some a; non-zero) 


Xg42 = &1X1 + à2Xo +... + àg41Xq41 


e can you generate (sign(a:), sign(a2), ..., sign(ag.1), x)? if so, 
what w? 


W/xg,2 = WIX as w!xo +... + ag 4 WI Xg44 


o x x 
> O(contradition!) 





‘general’ X no-shatter => dyc < d+ 1 
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The VC Dimension VC Dimension of Perceptrons 


Fun Time 


Based on the proof above, what is dvc of 1126-D perceptrons? 









Reference Answer: 9 
Well, too much fun for this section! :-) 
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Physical Intuition of VC Dimension 


Degrees of Freedom 


The VC Dimension 









1 
8 

10 9 
4 d 





ub C 
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—4 d4— 
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vs 


(modified aom the Work of inns Vermeiren ont 









e hypothesis parameters w = (Wo, Wi,::- , Wa): 
creates degrees of freedom 


e hypothesis quantity M = |H|: 
‘analog’ degrees of freedom 


e hypothesis ‘power dyc = d + 1: 
effective ‘binary’ degrees of freedom 


dyc(?1): powerfulness of H | 
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The VC Dimension Physical Intuition of VC Dimension 
Two Old Friends 

Positive Rays (dyc — 1) 
h(x) =—1 lo uc uet 


a 












free parameters: a 


Positive Intervals (dy; = 2) 





free parameters: £, r 


practical rule of thumb: 
dyo ~ #free parameters (but not always) 


The VC Dimension Physical Intuition of VC Dimension 


M and dve 


copied from Lecture 5 :-) 
@ can we make sure that Eout(g) is close enough to Ei(g)? 
@ can we make Ej, (g) small enough? 




























@ Yes!, 
P[BAD] < 2- M - exp(...) 
O Nol, too few choices 


Q No!, 
P[BAD] < 2- M - exp(...) 


® Yes!, many choices 










© Yes!, P[BAD] < 
4 - (2N)%¢ - exp(...) 
© No!, too limited power 


using the right dvc (or H) is important j 


Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/26 


© No!, P[BAD] < 
4 - (2N)%¢ - exp(...) 
® Yes!, lots of power 


The VC Dimension Physical Intuition of VC Dimension 


Fun Time 





Origin-crossing Hyperplanes are essentially perceptrons with wo 
fixed at 0. Make a guess about the dvo of origin-crossing 
hyperplanes in IR?. 













Reference Answer: G 


The proof is almost the same as proving the 
Qc for usual perceptrons, but it is the intuition 
(dyco ~ #free parameters) that you shall use to 
answer this quiz. 
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The VC Dimension Interpreting VC Dimension 


VC Bound Rephrase: Penalty for Model Complexity 
For any g = .A(D) € H and ‘statistical’ large D, (oed, > 2 






Pp | |Ein(9) — Eou(g)| > e] < 4(2N)*° exp (-32N) 
SS -E 
BĂD x 


Rephrase 
.. 4 with probability > 1 — 6, GOOD: |En(g) — Eout(g)| < € 








set i = 4(2N)*: exp (—4 2N) 
TENES = exp (—3N) 
In (AS) = 12n 
fin (Ape) = « 
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The VC Dimension Interpreting VC Dimension 


VC Bound Rephrase: Penalty for Model Complexity 
For any g = .A(D) € H and ‘statistical’ large D, (oed, > 2 


Pp | |Ein(9) — Eou(g)| > e] < 4(2N)*° exp (-32N) 
SS EE 
BAD x 





Rephrase 
..., With probability > 1 — 6, GOOD! 










gen. error |Ein(g) — Eout(9)| < 8 In (sem) 
dyc Ae 
g ey Four(g) €  Ei(g)4 (7 ) 





... penalty for model complexity 
—— 
Q(N, 71, 0) 
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The VC Dimension Interpreting VC Dimension 


THE VC Message 


with a high probability, 





Error 










out-of-sample error e. Ac ae Ein I but Q T 
* dvc 4: Q | but En T 
e best dy. in the middle 


model complexity 





in-sample error 





a VC dimension, dvo 


powerful H not always good! j 


Heater 2.05 


The VC Dimension Interpreting VC Dimension 


VC Bound Rephrase: Sample Complexity 
For any g = .A(D) € H and ‘statistical’ large D, (odo > 2 












P»|Es(g)-Ew(g)]»« ^ s 40M exp (— $e) 
—— —————Ó 
BÁD Y 


given specs c = 0.1, 6 = 0.1, dy; = 3, want 4(2N)% exp (—Ze2N) < ô 
N bound 
100 2.82 x 107 
1,000 9.17 x 10? sample complexity: 
10,000 1.19 x 108 need N = 10, 000d\¢- in theory 
100,000 1.65 x 10-98 
29,300 9.99 x 10-2 








practical rule of thumb: 
N ~ 10d\c often enough! | 
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The VC Dimension Interpreting VC Dimension 


Looseness of VC Bound 


Pp ||Ei Cg) ae =r) e] « 4(2N)*° exp (-&&N) 


theory: N ~ 10,0004dyc; practice: N ~ 10dyc 











e Hoeffding for unknown Eout any distribution, any target 
e m,(N) instead of |H(x1,...,Xy)| ‘any’ data 
e N%c instead of my (N) ‘any’ H of same dyo 
e union bound on worst cases any choice made by A 


—but hardly better, and ‘similarly loose for all models’ 


philosophical message of VC bound 
important for improving ML 
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The VC Dimension Interpreting VC Dimension 


Fun Time 


Consider the VC Bound below. How can we decrease the 


probability of getting BAD data? 







Pp| |Ein(9) ~ Eu(g)| > «| 4(2NN)% exp (- ^N) 











© decrease model complexity dvc 

@ increase data size N a lot 

© increase generalization error tolerance « 
© all of the above 


Reference Answer: (4) 


Congratulations on being 
Master of VC bound! :-) 
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The VC Dimension Interpreting VC Dimension 
Summary 
@ When Can Machines Learn? 
@ Why Can Machines Learn? 


Lecture 6: Theory of Generalization 






Lecture 7: The VC Dimension 
e Definition of VC Dimension 

maximum non-break point 
e VC Dimension of Perceptrons 











dyc (71) — d 4- 1 
e Physical Intuition of VC Dimension 
Qc + #free parameters 
e Interpreting VC Dimension 
loosely: model complexity & sample complexity 





e next: more than noiseless binary classification? 


How Can Machines Learn? 
@ How Can Machines Learn Better? 
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